Required packages

For this analysis, we need the following packages:

library(MASS)
library(ggplot2)
library(corrplot)
library(dplyr)
library(tidyr)
library(car)
library(psych)
library(plyr)
library(olsrr)
library(grid)
library(gridExtra)

I. Data Exploration

First, let’s take a quick look at the dataframe and its structure.

head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Boston$chas <- as.factor(Boston$chas)

The dataset contains 506 observations and 14 variables, including:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.*
  • ptratio: pupil-teacher ratio by town.
  • black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.

Is there any NA in our data?

colSums(is.na(Boston))
##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0       0 
## ptratio   black   lstat    medv 
##       0       0       0       0

Let’s also check the data for outliers. According to boxsplot charts, there are many of outliers in variables such as ‘black’, ‘crim’, ‘zn’ and ‘rm’.

Boston[,-4] %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = '',y = value)) +
  geom_boxplot(outlier.colour = "cornflowerblue", outlier.shape = 1) +
  facet_wrap(~ var, scales = "free") 

***

We can take a look at the distribution of variables by plotting their histograms.

Predictor histograms provide the following information:

  • Most variables have shifted distribution (‘crim’, ‘zn’, ‘black’ and ‘age’)
  • rad and tax seem to have two different peaks
  • rm is normally distributed
multi.hist(Boston[,-4])


Let’s build a scatterplot to have a quick look at the data.

plot(Boston[, -c( 2, 4, 6, 12)], col = 'cornflowerblue')

Summary: A brief analysis of the data provides the following observations:

  • there are no missing values in the dataset
  • many variables are not normally distributed and for them boxplot is characterized by the presence of outliers
  • let’s pay attention to the last graph, namely, pairwise comparisons of the medv variable with the rest of the variables. At first glance, it can be assumed that most of these plots indicate nonlinearity of the relationship between medv and other variables

II. Fitting predictors of multiple linear models

Standartisation

Before proceeding with the selection of the optimal linear model, it is necessary to standardize the variables which are not factor type, since they have different units of measurement.

# Standartisation

bost_stand <- as.data.frame(sapply(Boston[, -4], scale))
head(bost_stand)
##         crim         zn      indus        nox        rm        age      dis
## 1 -0.4193669  0.2845483 -1.2866362 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4169267 -0.4872402 -0.5927944 -0.7395304 0.1940824  0.3668034 0.556609
## 3 -0.4169290 -0.4872402 -0.5927944 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4163384 -0.4872402 -1.3055857 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4120741 -0.4872402 -1.3055857 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4166314 -0.4872402 -1.3055857 -0.8344581 0.2068916 -0.3508100 1.076671
##          rad        tax    ptratio     black      lstat       medv
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990  0.1595278
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324  1.3229375
## 4 -0.7521778 -1.1050216  0.1129203 0.4157514 -1.3601708  1.1815886
## 5 -0.7521778 -1.1050216  0.1129203 0.4406159 -1.0254866  1.4860323
## 6 -0.7521778 -1.1050216  0.1129203 0.4101651 -1.0422909  0.6705582

Building a full model

A complete regression model was built for the parameter ‘medv’. Below is the summary of this model. The most significant variables in the standardized data are ‘lstat’, ‘ptaratio’ and ‘rm’. The same is true for not standartiszated dataset. Further selection of the model was carried out on the initial data (Boston). The smallest contribution to the predictive ability of the model is made by such parameters as ‘age’ and ‘indus’ .The fraction of variance described by the full model is 0.7291

# Full correlation model
full_model_stand <- lm(medv~ ., data = bost_stand)
full_model <- lm(medv~ ., data = Boston)
summary(full_model_stand)
## 
## Call:
## lm(formula = medv ~ ., data = bost_stand)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45663 -0.30556 -0.07019  0.20812  2.86780 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.343e-16  2.314e-02   0.000 1.000000    
## crim        -1.058e-01  3.097e-02  -3.417 0.000686 ***
## zn           1.193e-01  3.511e-02   3.398 0.000734 ***
## indus        3.007e-02  4.603e-02   0.653 0.513889    
## nox         -2.188e-01  4.852e-02  -4.509 8.13e-06 ***
## rm           2.942e-01  3.219e-02   9.137  < 2e-16 ***
## age          8.520e-03  4.073e-02   0.209 0.834407    
## dis         -3.401e-01  4.606e-02  -7.383 6.64e-13 ***
## rad          3.108e-01  6.300e-02   4.934 1.10e-06 ***
## tax         -2.521e-01  6.901e-02  -3.653 0.000287 ***
## ptratio     -2.333e-01  3.093e-02  -7.542 2.25e-13 ***
## black        9.670e-02  2.686e-02   3.600 0.000351 ***
## lstat       -4.147e-01  3.965e-02 -10.459  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5205 on 493 degrees of freedom
## Multiple R-squared:  0.7355, Adjusted R-squared:  0.7291 
## F-statistic: 114.3 on 12 and 493 DF,  p-value: < 2.2e-16
summary(full_model)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas1        2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Selection of significant predictors

The least contributing predictors were discarded from the general model. We can see that the Adjusted R-squared has increased only to 0.7348 .

model_2 <- lm(formula = medv ~ . - indus - age , data = Boston)
summary(model_2)
## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas1         2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

We diagnosed the resulting model in order to understand whether it can be improved, and also check the conditions of applicability of our model.

III. Model Diagnostics

We have selected such a linear model whose predictors make a significant contribution to its predictive ability. However, before using this model for predictions, it is necessary to diagnose it and adjust.

Checking the linearity of the relationship

The graph below shows that many variables are nonlinearly related to medv. We are especially interested in a variable such as lstat (located on a separate graph), as it is one of the most influential variables and it is parabolic in nature. Therefore, it is necessary to correct the model by adding I (lstat ^ 2) to the model.

Boston[,-4] %>%   gather(-medv, key = "var", value = "value") %>%   filter(var != "chas") %>%
  ggplot(aes(x = value, y = medv)) +
  geom_point() +
  stat_smooth() +
  facet_wrap(~ var, scales = "free")

  ggplot(Boston, aes(x = lstat, y = medv)) +
  geom_point() +
  stat_smooth()

Residual Analysis

The residuals plot indicates their uneven distribution. Heteroscedasticity is present, which indicates that prediction errors are different for different ranges of predicted values. This is not surprising since the relationship between predictors and the dependent variable is not linear.

The quantile plot looks bad too, and we see the following:

  • The variance is not completely constant, that is, the assumption of constant variance is not completely fulfilled.
  • From the q-q plot we can see that this is not quite normal and is slightly offset to the right.
  • Observed emissions
model_2_diag <- data.frame(fortify(model_2), Boston[, -c(3, 7)])

gg_medv_2 <- ggplot(data = model_2_diag, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) +
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")
gg_medv_2
## `geom_smooth()` using formula 'y ~ x'

qqPlot(model_2_diag$.stdresid)

## [1] 369 372

We also built graphs for predictors that were not included in the model. We see that there are no unaccounted dependencies in our model.

res_1 <- gg_medv_2 + aes(x = age)
res_2 <- gg_medv_2 + aes(x = indus)
res_1
## `geom_smooth()` using formula 'y ~ x'

res_2
## `geom_smooth()` using formula 'y ~ x'


Cook’s distance graph

We can use Cook’s distance by examining whether an observation is a potential outlier or an influencing variable.Following the image on the graph, we can confirm the presenceof significant number of outliers. However, without information about how these observations were obtained, we cannot throw them out of the dataset.

ols_plot_cooksd_bar(model_2)


Multicollinearity check

Checking for multicollinearity showed that our data have a very high level of predictor correlation. Correlation between predictors does not make our model unusable, but it is better to get rid of predictors that are highly correlated with others. A VIF of 1 indicates two variables are not correlated, a VIF between 2 and 5 indicates a moderate correlation, and a VIF above 5 indicates a high correlation. The variables tax and rad have the highest correlation coefficient, and the correlation coefficient between these variables (see corrplot) is 0.91. Let’s try to remove the tax and rad variables from the analysis

vif(full_model)
##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491

A correlation matrix was built to visualize the interactions between variables. Following the data, we can talk about the presence of a correlation between the following predictors:

  • rm and lstat
  • rad and lstat
  • rad and dis
  • rad and tax
  • nox and dis
  • tax and lstat
  • tax and crim
corr_matrix<-cor(Boston[,-4])
corrplot.mixed(corr_matrix,  lower.col = "black", number.cex = .7,  order = "AOE")

mod_2_vif <- update(full_model, .~. - tax -rad -indus -age)
vif(mod_2_vif)
##     crim       zn     chas      nox       rm      dis  ptratio    black 
## 1.476281 2.119038 1.051879 3.034316 1.774261 3.410535 1.395503 1.302455 
##    lstat 
## 2.577927

We can observe that the value of the VIF for predictors has dropped significantly. Variables ‘nox’ and ‘dis’ also have large VIF indicators, but we will not remove them from the model for now.

IV. Correction of the optimal model

Taking into account the data obtained during the analysis of the data and diagnostics of the model, the following corrections can be made:

  • adding (lstat ^ 2) since this predictor is non-linear
  • excluding predictors ‘tax’ and ‘rad’ as it contributes greatly to multicollinearity
  • adding influential interactions between predictors to the model

The elimination of ‘tax’ and ‘rad’ variables allowed us to increase the value of R-squared to 0.7724.

model_3 <- lm(formula = medv ~ . +I(lstat^2) - indus - age - tax -rad , data = Boston)
summary(model_3)
## 
## Call:
## lm(formula = medv ~ . + I(lstat^2) - indus - age - tax - rad, 
##     data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.654  -2.586  -0.337   1.750  25.985 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.517571   4.462481   7.959 1.19e-14 ***
## crim         -0.101653   0.027860  -3.649 0.000292 ***
## zn            0.019018   0.012390   1.535 0.125425    
## chas1         2.853024   0.788663   3.618 0.000328 ***
## nox         -11.464462   2.969327  -3.861 0.000128 ***
## rm            3.543924   0.374807   9.455  < 2e-16 ***
## dis          -1.306845   0.171681  -7.612 1.38e-13 ***
## ptratio      -0.660814   0.107932  -6.123 1.88e-09 ***
## black         0.007027   0.002444   2.875 0.004214 ** 
## lstat        -1.690324   0.121115 -13.956  < 2e-16 ***
## I(lstat^2)    0.033616   0.003256  10.324  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.388 on 495 degrees of freedom
## Multiple R-squared:  0.7769, Adjusted R-squared:  0.7724 
## F-statistic: 172.3 on 10 and 495 DF,  p-value: < 2.2e-16

We also excluded the predictor zn from the model since it makes not significant contribution.

model_4 <- lm(formula = medv ~ . + I(lstat^2) -indus - age -tax -rad - zn , data = Boston)

summary(model_4)
## 
## Call:
## lm(formula = medv ~ . + I(lstat^2) - indus - age - tax - rad - 
##     zn, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.753  -2.617  -0.333   1.750  26.172 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.809520   4.464517   8.021 7.62e-15 ***
## crim         -0.096228   0.027673  -3.477 0.000551 ***
## chas1         2.828852   0.789583   3.583 0.000374 ***
## nox         -11.399561   2.973082  -3.834 0.000142 ***
## rm            3.608807   0.372924   9.677  < 2e-16 ***
## dis          -1.175195   0.148926  -7.891 1.93e-14 ***
## ptratio      -0.704267   0.104296  -6.753 4.07e-11 ***
## black         0.006909   0.002446   2.824 0.004928 ** 
## lstat        -1.720591   0.119663 -14.379  < 2e-16 ***
## I(lstat^2)    0.034516   0.003207  10.761  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.394 on 496 degrees of freedom
## Multiple R-squared:  0.7758, Adjusted R-squared:  0.7717 
## F-statistic: 190.7 on 9 and 496 DF,  p-value: < 2.2e-16

Adding interactions between predictors to the model increased the proportion of variance explained by the model. (R-square: 0.8254).

model_5 <- lm(formula = medv ~ . + I(lstat^2) -indus -lstat - age - zn - tax -rad + rm*lstat + rm*ptratio + nox*dis + dis*lstat +crim*lstat, data = Boston)
summary(model_5)
## 
## Call:
## lm(formula = medv ~ . + I(lstat^2) - indus - lstat - age - zn - 
##     tax - rad + rm * lstat + rm * ptratio + nox * dis + dis * 
##     lstat + crim * lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6254  -2.1439  -0.4422   1.7473  23.9445 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -90.600621  13.389910  -6.766 3.77e-11 ***
## crim          0.158823   0.080433   1.975 0.048874 *  
## chas1         2.347801   0.701124   3.349 0.000875 ***
## nox           7.012868   4.552769   1.540 0.124119    
## rm           21.541929   1.920772  11.215  < 2e-16 ***
## dis           0.878470   0.689294   1.274 0.203107    
## ptratio       4.991722   0.716550   6.966 1.05e-11 ***
## black         0.004126   0.002199   1.876 0.061183 .  
## I(lstat^2)    0.026517   0.004749   5.584 3.90e-08 ***
## lstat        -0.368343   0.414894  -0.888 0.375084    
## rm:lstat     -0.230808   0.048441  -4.765 2.49e-06 ***
## rm:ptratio   -0.853330   0.111020  -7.686 8.32e-14 ***
## nox:dis      -6.186074   1.747484  -3.540 0.000438 ***
## dis:lstat     0.105737   0.020530   5.150 3.77e-07 ***
## crim:lstat   -0.013678   0.003794  -3.605 0.000344 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.844 on 491 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8254 
## F-statistic: 171.5 on 14 and 491 DF,  p-value: < 2.2e-16

This model includes interactions between predictors with a high correlation coefficient, but we remember that the variables ‘nox’ and ‘dis’ have a high VIF value.Now they do not contribute significantly in our model. We threw them out of the model and looked to see if the R-squared value dropped significantly

model_6 <-lm(formula = medv ~ . + I(lstat^2) -indus - age - zn - tax - rad - nox - dis +rm*ptratio +crim*lstat, data = Boston)

summary(model_6)
## 
## Call:
## lm(formula = medv ~ . + I(lstat^2) - indus - age - zn - tax - 
##     rad - nox - dis + rm * ptratio + crim * lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8995  -2.3284  -0.5856   1.8051  27.1191 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.110e+02  1.359e+01  -8.165 2.68e-15 ***
## crim         2.809e-01  8.365e-02   3.358 0.000846 ***
## chas1        3.616e+00  7.474e-01   4.838 1.76e-06 ***
## rm           2.423e+01  2.056e+00  11.785  < 2e-16 ***
## ptratio      6.676e+00  7.354e-01   9.079  < 2e-16 ***
## black        7.039e-03  2.319e-03   3.035 0.002533 ** 
## lstat       -1.596e+00  1.181e-01 -13.511  < 2e-16 ***
## I(lstat^2)   3.504e-02  3.619e-03   9.681  < 2e-16 ***
## rm:ptratio  -1.140e+00  1.137e-01 -10.020  < 2e-16 ***
## crim:lstat  -1.905e-02  3.989e-03  -4.777 2.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.204 on 496 degrees of freedom
## Multiple R-squared:  0.7948, Adjusted R-squared:  0.7911 
## F-statistic: 213.5 on 9 and 496 DF,  p-value: < 2.2e-16

We can see that R-squared dropped by 3 percent, while we removed the multicollinear variables

Diagnostic

After the model was corrected, we ran the diagnostics again. To begin with, we will mark the points that we could not change when adjusting the model: * we were unable to influence the nonlinearity of the relationship between predictors and the dependent variable (excluding the variable lstat) * we were unable to get rid of heteroscedantics (see the residual graph). This means that prediction errors are different for a different range of predicted values * we did not discard outliers from the data, since we do not know if they are of particular importance or are errors in data collection. Therefore, we also see outliers on the residual and Cook’s graph.

Below are the graphs of the residuals of the original model and the adjusted one. We can notice that there are no improvements

model_6_diag <- data.frame(fortify(model_6), Boston[, c(1, 4,6, 11, 12, 13)])

gg_medv <- ggplot(data = model_6_diag, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) +
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")

gg_medv_2
## `geom_smooth()` using formula 'y ~ x'

gg_medv
## `geom_smooth()` using formula 'y ~ x'

ols_plot_cooksd_bar(model_2)

ols_plot_cooksd_bar(model_6)

  • below are the quantile plots of the linear model before and after correction. We can see that the distribution hasn’t gotten better, and it is still different from the ideal normal distribution.
qqPlot(model_2_diag$.stdresid)

## [1] 369 372
qqPlot(model_6_diag$.stdresid)

## [1] 372 373

Nevertheless, we managed to increase R-squared of the model, as well as make it a little more suitable for the conditions of applicability:

  • we took into account the nonlinearity of the relationship between the dependent variable and lstat, it was squared for a better fit
  • we also added interactions between predictors
  • we have removed the variables with the largest VIF, for the remaining variables their value is below 4
model_6_stand <-lm(formula = medv ~ . + I(lstat^2) -indus - age - zn - tax - rad - nox - dis +rm*ptratio +crim*lstat, data = bost_stand)
vif(model_6_stand)
##       crim         rm    ptratio      black      lstat I(lstat^2) rm:ptratio 
##   3.204936   1.831893   1.310993   1.279215   3.165839   2.403179   1.225028 
## crim:lstat 
##   3.431738

V. Prediction

So, our final multiple linear model predicting the average cost of houses in Boston includes predictors such as crime rate by town (crim), is the river adjacent to the site (chas), average number of rooms per dwelling (rm), pupil-teacher ratio by town (ptratio), percrnt of lower status of the population (lstat), the proportion of blacks by town (black). Our model also includes the squared values of the lstat variable, since the lstat variable contributes most to the predictive power of the model and has an abnormal distribution and non-linear relationship with the dependent variable.

MEDV = -13.5lstat + 9.7I(lstat^2)+ 4.8chas + 3.4crim + 11.8rm + 9ptratio + 3black -10rm:ptratio -4.7lstat:crim

An artificial dataset was created to predict the model. The variable lstat was selected as the target predictor. Thus, the graph of predictions built on this dataset shows how the dependent variable changes when lstat changes, provided that all other variables have average values.

We see that the prediction graph of our model is non-linear, which casts doubt on the validity of our model. We also built a simple linear model where only lstat is the predictor. The prediction graph of such a model has a linear form

model_final <- lm(formula = medv ~ lstat + I(lstat^2)  +crim +rm +ptratio +black +rm*ptratio +crim*lstat, data = Boston)
summary(model_final)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2) + crim + rm + ptratio + 
##     black + rm * ptratio + crim * lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5864  -2.4594  -0.5182   1.8857  26.8633 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.073e+02  1.387e+01  -7.732 5.94e-14 ***
## lstat       -1.603e+00  1.207e-01 -13.276  < 2e-16 ***
## I(lstat^2)   3.568e-02  3.698e-03   9.650  < 2e-16 ***
## crim         3.092e-01  8.531e-02   3.624  0.00032 ***
## rm           2.380e+01  2.100e+00  11.331  < 2e-16 ***
## ptratio      6.438e+00  7.501e-01   8.583  < 2e-16 ***
## black        7.488e-03  2.369e-03   3.161  0.00167 ** 
## rm:ptratio  -1.110e+00  1.161e-01  -9.558  < 2e-16 ***
## lstat:crim  -2.053e-02  4.066e-03  -5.051 6.20e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.297 on 497 degrees of freedom
## Multiple R-squared:  0.7851, Adjusted R-squared:  0.7817 
## F-statistic:   227 on 8 and 497 DF,  p-value: < 2.2e-16
# Dataset for model prediction
MyData <- data.frame(
  lstat = seq(min(Boston$lstat), max(Boston$lstat), length.out = 100),
  ptratio = mean(Boston$ptratio),
  crim = mean(Boston$crim),
  rm = mean(Boston$rm),
  black = mean(Boston$black),
  rad = mean(Boston$rad))



# Predicted values
Predictions <- predict(model_final, newdata = MyData,  interval = 'confidence')
MyData <- data.frame(MyData, Predictions)

# Model prediction plot
Pl_predict <- ggplot(MyData, aes(x = lstat, y = fit)) +
  geom_ribbon(alpha = 0.2, aes(ymin = lwr, ymax = upr)) +
  geom_line() + 
  ggtitle("Multiple model")
Pl_predict 

simple_model <- lm(formula = medv ~ lstat, data = Boston)
summary(simple_model)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
# Dataset for model prediction
s_m_newdata <- data.frame(lstat = seq(min(Boston$lstat), max(Boston$lstat), length.out = 100)) 

medv_predicted <- predict(simple_model, newdata = s_m_newdata, interval = "confidence") 
simple_data <- data.frame(s_m_newdata, medv_predicted) 


simple_predict <- ggplot(simple_data, aes(x = lstat, y = fit)) +
  geom_line() +
  geom_ribbon(alpha = 0.2, aes(ymin = lwr, ymax = upr)) + 
  ggtitle("Simple model")

grid.arrange(Pl_predict, simple_predict, ncol = 2)

We briefly checked the simple linear model against the applicability conditions: * the plot of residuals indicates the presence of heteroscedantism. This was expected, since the lstat variable has a hyperbolic dependence on the medv variable. * Cook’s distance graph indicates the presence of outliers * the quantile graph shows that the distribution is different from normal

summary(simple_model)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
simple_model_diag <- data.frame(fortify(simple_model), Boston)

gg_medv_simp <- ggplot(data = simple_model_diag, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) +
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")

gg_medv_simp
## `geom_smooth()` using formula 'y ~ x'

qqPlot(simple_model_diag$.stdresid)

## [1] 372 373
ols_plot_cooksd_bar(simple_model)

Summary: We have fitted a linear model that predicts the average cost of houses in Boston with a probability of 79%. However, it should be noted that it is far from ideally consistent with the conditions of applicability of the multiple linear model, which should be taken into account when interpreting it. Nevertheless, when working with real data, it is usually difficult to find a model that meets all the conditions of applicability. As a result, we found that the linear regression model we built has a non-linear form.

We also built a multiple linear model without correction for the nonlinear dependence of the lstat and medv variables. We had to exclude from this model the variable crim and its relationship with the variable lstat, since they lost their significance for the model. The R-squared of this model is 75%, while for a simple linear model it is only 54%. The prediction graph of such a model has a linear form.

model_minimal <- lm(formula = medv ~ lstat    +rm +ptratio +black +rm*ptratio , data = Boston)

# Dataset for model prediction
MyData_min <- data.frame(
  lstat = seq(min(Boston$lstat), max(Boston$lstat), length.out = 100),
  ptratio = mean(Boston$ptratio),
  rm = mean(Boston$rm),
  black = mean(Boston$black),
  rad = mean(Boston$rad)  )
# Predicted values
Predictions_min <- predict(model_minimal, newdata = MyData_min,  interval = 'confidence')
MyData_min <- data.frame(MyData_min, Predictions_min)

# Model prediction plot
Pl_predict_min <- ggplot(MyData_min, aes(x = lstat, y = fit)) +
  geom_ribbon(alpha = 0.2, aes(ymin = lwr, ymax = upr)) +
  geom_line() + 
  ggtitle("Multiple model")
grid.arrange(Pl_predict_min, simple_predict, ncol = 2)

The diagnostics of this model also indicates that it does not meet the conditions of applicability.

summary(model_minimal)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + rm * ptratio, 
##     data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4767  -2.5182  -0.8243   1.5207  28.0895 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.291e+02  1.443e+01  -8.948  < 2e-16 ***
## lstat       -5.441e-01  4.064e-02 -13.389  < 2e-16 ***
## rm           2.669e+01  2.196e+00  12.152  < 2e-16 ***
## ptratio      7.079e+00  7.931e-01   8.926  < 2e-16 ***
## black        9.546e-03  2.493e-03   3.829 0.000145 ***
## rm:ptratio  -1.241e+00  1.223e-01 -10.151  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.703 on 500 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7385 
## F-statistic: 286.2 on 5 and 500 DF,  p-value: < 2.2e-16
model_min_diag <- data.frame(fortify(model_minimal), Boston[, c( 4,6, 11, 12, 13)])

gg_medv_min <- ggplot(data = model_min_diag, aes(x = .fitted, y = .stdresid)) + 
  geom_point() + 
  geom_hline(yintercept = 0) +
  geom_smooth(method = "lm") +
  geom_hline(yintercept = 2, color = "red") +
  geom_hline(yintercept = -2, color = "red")

gg_medv_min
## `geom_smooth()` using formula 'y ~ x'

qqPlot(model_min_diag$.stdresid)

## [1] 373 372
ols_plot_cooksd_bar(model_minimal)

model_minimal_stand <- lm(formula = medv ~ lstat+rm +ptratio +black +rm*ptratio , data = bost_stand)
vif(model_minimal_stand)
##      lstat         rm    ptratio      black rm:ptratio 
##   1.922771   1.787598   1.237425   1.182608   1.134234

Conclusion.

In this study, a selection of a multiple linear model was carried out to predict the value of houses in Boston depending on various variables. A simple linear model was built, as well as several multiple ones, among which: a full model, including all predictors; the optimal model with the correction of the nonlinearity of the relationship between the predictor lstat and the dependent variable and the optimal model without this correction.

All of the models are characterized by non-compliance with the conditions of applicability. We consider Boston data is not suitable for selecting a correct linear model, however, in a situation, if the customer asks us about it, we can provide the following analysis results for each multiple model.

Full model The full linear model includes all predictors and explains 73% of the variance in the dependent variable. We do not recommend using this model to predict the value of houses, as it: * includes variables that do not significantly contribute to describing the variability of the dependent variable * includes collinear predictors * has the smallest R-squared value

Optimal model with the correction of the nonlinearity of the relationship between the predictor and the dependent variable This model has the largest R-squared value and explains 79% of the variance in the dependent variable. From this model of exclusion, predictors that do not contribute significantly, as well as those that exhibit multicollinearity. The advantage of this model is that it takes into account the nonlinearity of the relationship between lstat and medv, but this model has a nonlinear prediction graph, which means that the model changes its behavior depending on the value of the predictors, which is quite difficult to take into account. Also, this model includes quite a few predictors, and it will be difficult for the customer to take into account all of them.

Otimal model without I(lstat^2) The R-squared value of this model is 4% less than the previous one, but it includes the variables most significant for prediction and the graph of its predictions behaves linearly on the data


The client was interested in what the ideal neighborhood should look like and what aspects should be paid attention to in order to maximize the price of the house.

Based on the data of the linear model, we first of all advise you to pay attention to the percentage of the population of the lowest class, namely, the customer should choose an area with a minimum percentage. In this case, there will also be less crime in this area. Also the more rooms a house has, the more its value will be. This is supported by our model. The ratio of the number of students to the number of teachers will also affect the price. The customer should choose the area where the more elite / private schools are located. It will also increase the value of the home, as the area is considered more elite.

That is, for the construction of expensive houses, the customer first of all needs to select areas with a minimum number of people with a low status (such areas will also be safer), with private / elite schools. A large number of rooms will increase the value of the house.